Introduction to Colossus

Understand the importance of Colossus and its high-level design.

Colossus is the descendant of the Google File system (GFS). Let's examine why Google needed to develop Colossus (a new file system) while GFS was already there.

Disclaimer: Google hasn’t revealed the full design of the Colossus system, so there are many open questions. For some of them, we have speculatively given answers, while for others, we encourage learners to consider them on their own.

Need for Colossus#

GFS was built to meet the requirement of storing a few million files with a total of hundreds of terabytes of data. For this scale, it was sufficient to have a single manager with some workload optimizations. It helped Google develop a file system for its users much more rapidly, which wouldn’t have been possible with a distributed manager-based design. Distributed models are complex and it takes time to design such systems.

All of Google's applications, including Google Cloud Storage services, use Google's own file system. The growing number of Google's applications and Cloud users has led to massive growth in data needs, requiring a file system that can scale to multiple petabytes (exabytes). GFS can't scale to exabytes due to a single manager. A single manager is responsible for managing metadata, storing metadata in memory and on the manager node's local disk, scanning through metadata for garbage collection and fault recovery, and handling metadata requests from all clients.

An increase in the volume of data from a few hundred terabytes to multiple petabytes makes it difficult for the single manager to maintain metadata at such a large scale. The metadata storage requirements also grow with the volume of data. Scanning through a large volume of metadata adds to the operation cost. Applications like MapReduce request many files simultaneously, generating a large number of metadata requests at the manager simultaneously. A single manager capable of performing thousands of operations per second becomes a bottleneck for such clients.

The Google Cloud services (the services that all run on the same file system infrastructure) scale when the underlying file system scales
The Google Cloud services (the services that all run on the same file system infrastructure) scale when the underlying file system scales

At the time GFS was being developed, most workloads had high throughput requirements rather than low latency. Therefore, Google focused on providing a high throughput file system. The GFS design is not good for latency-sensitive applications because of the single point of failure. There is a single manager for all the metadata operations. Even if GFS had the mechanism to recover the manager, it might still be unavailable for a minute. This downtime is not a significant problem for batch-oriented applications that require high throughput and can bear a latency of a few seconds. However, for applications like video serving, downtime is unacceptable.

In addition to manager failure, chunkserver failures also add to the latency. This is because if the number of replicas for a chunk falls below a certain number, the manager prioritizes making a copy of that chunk first from a valid replica (it may take a few seconds). It then passes the control back to the client to continue writing on the chunk. Batch operations that last between a few minutes to several hours may not register a lag of a few seconds. This may be a serious problem for streaming or real-time operations. For user-facing applications, a downtime of even a few seconds is not acceptable.

Let's summarize the new requirements added to the underlying file system of Google's applications and Google Cloud clients.

Requirements#

The basic functions of the Colossus file system are the same as those of GFS, so let’s focus on the following additional requirements.

  • Scalability to exabytes: Due to the accelerating volume of data generated by an increasing number of applications, it was not possible to manage data at such a big scale by using GFS as the underlying file system for all applications.
  • Low latency: Interactive applications like video gaming, online meetings (including Google Meet), video serving, and others require a response in real-time. We know that the GFS metadata service, because of a single point of failure, can go down for a minute, so we have to come up with a file system that is highly available.

    Video serving in comparison to other interactive applications can be handled with GFS because we are streaming data and we can buffer it.

To meet the requirements above, Google developed a file system that is an extension of GFS called Colossus. Scalability and latency issues were highlighted by GFS's centralized metadata model. As a result, Colossus introduced a distributed metadata model that is more scalable and highly available. The disaggregation of a metadata store and increased availability are the major differences between GFS and Colossus and is shown in the following illustration.

A single manager-based metadata model (GFS) vs. a distributed metadata model (Colossus)
A single manager-based metadata model (GFS) vs. a distributed metadata model (Colossus)

Let's look at the overall architecture of Colossus.

Colossus architecture#

The Colossus file system is designed to withstand the enormous growth in data requirements of a continually increasing number of applications. It consists of the following components:

  • Client library: The Colossus client library is the most complex part of the Colossus file system. It plays the same role as the GFS client in the Google File System. It enables applications or services to interact with Colossus. There is a lot of functionality in the client library that helps applications fine-tune performances according to their requirements and make trade-offs for different workloads. For example, clients can choose to use either full data replication or Reed-Solomon encoding-based storage (efficient in storage use but at the expense of encoding and decoding processing).

In a full replication, there will be an N number of data copies. For erasure codes such as Reed-Solomon, parity bits are added. In most cases, storage use is less than full replication, but the catch is that for writing, we need to burn the CPU in order to encode, and for reading, we again need to burn the CPU to do the decoding.

  • Control plane: A control plane consists of front-end metadata servers called curators that are horizontally scalable. Curators carry out the control operations such as file creation operations from the clients. There can be hundreds of manager nodes, each managing 100 million files. We can reduce the latency by moving to a distributed-manager file system.

Architecture of Colossus
Architecture of Colossus
  • Metadata database: All curators keep their metadata in Bigtable, Google's high-performance database. This makes it easier to work around the scale issues presented in the GFS single manager design for metadata. By storing file metadata in Bigtable, Colossus scales over the largest GFS clusters by more than 100x.

  • Custodians: These are background storage managers that perform tasks like garbage collection, replication, re-replication, and rebalancing. Custodians ensure data durability and availability guarantees.

  • D file servers: These are network-attached disks where the actual data bytes are stored. Data flows directly between clients and D file servers just like GFS.

Mapping GFS components to Colossus#

The replacement of GFS components with Colossus components is shown in the illustration below. The fundamental replacement is the Colossus distributed metadata service.

Mapping GFS components to Colossus’s file system components
Mapping GFS components to Colossus’s file system components

In this lesson, we’ve seen the limitations of GFS's scalability and availability for a growing number of applications and data. The major bottleneck at GFS was its single manager metadata service. Colossus introduced a distributed metadata model to cope with the growing demands. We’ve seen the high-level design of Colossus and will dig into the details of how it all works in the next lesson.

Note: Most of thechapters have a Bird's Eye view section to summarize the chapter and mark a milestone in our learning journey. Due to the small size of this chapter, we haven't included that Bird's Eye View section here.

Quiz on GFS

Design and Evaluation of Colossus